Various improvements to text fingerprinting

نویسندگان

  • Djamal Belazzougui
  • Roman Kolpakov
  • Mathieu Raffinot
چکیده

Let s = s1..sn be a text (or sequence) on a finite alphabet Σ of size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set F of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring si..sj is a maximal location for a fingerprint f ∈ F (denoted by 〈i, j〉) if the alphabet of si..sj is f and si−1, sj+1, if defined, are not in f . The set of maximal locations in s is L (it is easy to see that |L| ≤ nσ). Two maximal locations 〈i, j〉 and 〈k, l〉 such that si..sj = sk..sl are named copies, and the quotient set of L according to the copy relation is denoted by LC . We first present new exact efficient algorithms and data structures for the following three problems: (1) to compute F ; (2) given f as a set of distinct characters in Σ, to answer if f represents a fingerprint in F ; (3) given f , to find all maximal locations of f in s. As well as in papers concerning succinct data structures, in the paper all space complexities are counted in bits. Problem 1 is solved either in O(n+ |LC | log σ) worst-case time (in this paper all logarithms are intended as base two logarithms) using O((n+ |LC |+ |F| log σ) logn) bits of space, or in O(n+ |L| log σ) randomized expected time using O((n + |F| log σ) logn) bits of space. Problem 2 is solved either in O(|f |) expected time if only O(|f | logn) bits of working space for queries is allowed, or in worst-case O(|f |/ ) time if a working space of O(σ logn) bits is allowed (with a constant satisfying 0 < < 1). These algorithms use a data structure that occupies |F|(2 log σ + log2 e)(1 + o(1)) bits. Problem 3 is solved with the same time complexity as Problem 2, but with the addition of an occ term to each of the complexities, where occ is the number of maximal locations corresponding to the given fingerprint. Our solution of this last problem requires a data structure that occupies O((n+ |LC |) logn) bits of memory. In the second part of our paper we present a novel Monte Carlo approximate construction approach. Problem 1 is thus solved in O(n + |L|) expected time using O(|F| logn) bits of space but the algorithm is incorrect with an extremely small probability that can be bounded in advance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

High capacity steganography tool for Arabic text using 'Kashida'

Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...

متن کامل

Formulation of a Traditionally Used Polyherbal Product for Burn Healing and HPTLC Fingerprinting of Its Phenolic Contents

Nowadays, plants have been considered as powerful agents for treatment of disorders regarding to their traditional use. In Iranian Traditional Medicine (ITM), plants have a special role in the treatment of various diseases. Burns with their devastating outcomes have been discussed in ITM as well. In the present study, a polyherbal ointment (PHO), retrieved from ITM, was formulated for burn heal...

متن کامل

Introduction to hyphenated techniques and their applications in pharmacy

The hyphenated technique is developed from the coupling of a separation technique and an on-line spectroscopic detection technology. The remarkable improvements in hyphenated analytical methods over the last two decades have significantly broadened their applications in the analysis of biomaterials, especially natural products. In this article, recent advances in the applications of various hyp...

متن کامل

Information Hiding for Text by Paraphrasing

Digital fingerprinting becomes paid growing attention as a technology resolving copyright problems. Previously, researchers have been only interested in image based digital fingerprinting where secret information is hidden in images, and text have not been the main target of hiding information. In this paper, we propose an information hiding method for text. Our information hiding method is bas...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Discrete Algorithms

دوره 22  شماره 

صفحات  -

تاریخ انتشار 2013